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Abstract OpenAIRE, the Open Access Infrastructure for Research in 
Europe, comprises a database of all EC FP7 and H2020 funded research 
projects, including metadata of their results (publications and datasets). 
These data are stored in an HBase NoSQL database, post-processed, 
and exposed as HTML for human consumption, and as XML through a 
web service interface. As an intermediate format to facilitate statistical 
computations, CSV is generated internally. To interlink the OpenAIRE 
data with related data on the Web, we aim at exporting them as Linked 
Open Data (LOD). The LOD export is required to integrate into the 
overall data processing workflow, where derived data are regenerated 
from the base data every day. We thus faced the challenge of identifying 
the best-performing conversion approach. We evaluated the performances 
of creating LOD by a MapReduce job on top of HBase, by mapping the 
intermediate CSV files, and by mapping the XML output. 


1 Introduction 

The European Commission emphasizes open access as a key tool to bring to¬ 
gether people and ideas in a way that catalyses science and innovation. More 
than ever before, there is a recognized need for digital research infrastructures 
for all kinds of research outputs, across disciplines and countries. OpenAIRE, the 
Open Access Infrastructure for Research in Europe (http://www.openaire.eu), 
(1) manages scientific publications and associated scientific material via repos¬ 
itory networks, (2) aggregates Open Access publications and links them to re¬ 
search data and funding bodies, and (3) supports the Open Access principles via 
national helpdesks and comprehensive guidelines. 

Data related to those in the OpenAIRE information space exist in different 
places on the Web. Combining them with OpenAIRE will enable new use cases. 
For example, understanding changes of research communities or the emergence 
of scientific topics not only requires metadata about publications and projects, 
as provided by OpenAIRE, but also data about events such as conferences as 
well as a knowledge model of research topics and subjects (cf. pQ). 

The availability of data that is free to use, reuse and redistribute (i.e. open 
data) is the first prerequisite for analysing such information networks. However, 


the diverse data formats and means to access or query data, the use of duplicate 
identifiers, and the heterogeneity of metadata schemas pose practical limitations 
on reuse. Linked Data, based on the RDF graph data model, is now increasingly 
accepted as a lingua franca to overcome such barriers |2 . 

The University of Bonn is coordinating the effort of publishing the OpenAIRE 
data as Linked Open Data (LOD) and linking it to related datasets in the rapidly 
growing LOD CloucQ This effort is further supported by the Athena Research 
and Innovation Center and CNR-ISTI. Besides data about scientific events and 
subject classification schemes, relevant data sources include public sector inform¬ 
ation (e.g., to find research results based on the latest employment statistics, or 
to answer questions such as ‘how do the EU member states’ expenses for health 
research compare to their health care spendings?’) and open educational re¬ 
sources (‘how soon do emergent research topics gain wide coverage in higher 
education?’). 

Concrete steps towards this vision are (1) mapping the OpenAIRE data 
model to suitable standard LOD vocabularies, (2) exporting the objects in the 
OpenAIRE information space as a LOD graph and (3) facilitating integration 
with related LOD graphs. Expected benefits include 

— enabling semantic search over the outputs of European research projects, 

— simplifying the way the OpenAIRE data can be enriched by third-party 
services, and consumed by interested data or service providers, 

— facilitated outreach to related open content and open data initiatives, and 

— enriching the OpenAIRE information space itself by exploiting how third 
parties will use its LOD graph. 

The specifically tailored nature of the OpenAIRE infrastructure, its large 
amount of data (covering more than 11 million publications) and the frequent 
updates of the more than 5000 repositories from which the data is harvested pose 
high requirements on the technology chosen for mapping the OpenAIRE data to 
LOD. We therefore compared in depth three alternative mapping methods, one 
for each source format in which the data are available: HBase, CSV and XML. 

Section [2] introduces the OpenAIRE data model and the three existing data 
sources. Section [3] presents our specification of the OpenAIRE data model as 
an RDF vocabulary. Section [4] establishes requirements for the mapping. Sec¬ 
tion [5] presents the state of the art for each of the three mapping approaches. 
Section [6] explains our three implementations. In section [7] we evaluate them in 
comparison, with regard to different metrics induced by the requirements. Sec¬ 
tion [3] reviews work related to our overall approach (comparing mappings and 
producing research LOD). Section [3] concludes and outlines future work. 

2 Input Data 

The data model of OpenAIRE infrastructure is specified as an entity relationship 
model (ERM) j3!4] with the following entity categories: 
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— Main entities (cf. figure [l]Q Result (Publication or Dataset), Person, Or¬ 
ganization, Projects, and DataSource (e.g. Repository, Dataset Archive or 
CRI0. Instances of these are continuously harvested from data providers. 

— Structural entities representing complex information about main entit¬ 
ies: Instances (of a Result in different DataSources), WebResources, Titles, 
Dates, Identities, and Subjects. 

— Static entities, whose metadata do not change over time: Funding. E.g., 
once a funding agency has opened a funding stream, it remains static. 

— Linking entities represent relationships between entities that carry further 
metadata; e.g., an entity of type Person_Result whose property ranking has 
the value 1 indicates the first author. 



Figure 1: OpenAIRE Data Model: core entities and relationships 


So far, the OpenAIRE data have been available in three formats: HBase, 
CSV and XML. 


2.1 HBase 

Currently, the master source of all OpenAIRE data is kept in HBase, a column 
store based on HDFS (Hadoop Distributed File System). HBase was introduced 
in 2012 when data integration efforts pushed the original PostgreSQL database 
to its limits: joins became inefficient and parallel processing, as required for de¬ 
duplication, was not supported. Each row of the HBase table has a unique row 
key and stores a main entity and a number of related linked entities. The attrib¬ 
ute values of the main entities are stored in the <family>:body column, where 
the <family> is named after the type of the main entity, e.g., result , person , 

5 https://issue.openaire.research-infrastructures.eu/projects/ 
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project , organization or datasource. The attribute values of linked entities, in¬ 
dicating the relationship between main entities, are stored in dedicated column 
families <family>:<column >, where <family> is the class of the linked entity 
and <column> is the row key of the target entity. Both directions of a link are 
represented. Cell values are serialized as byte arrays according to the Protocol 
Buffers [5] specification; for example: 


message Person { 

optional Metadata metadata = 2; 
message Metadata { 

optional StringField firstname = 1; 
repeated StringField secondnames = 2; 
optional Qualifier nationality =9; ... } 
repeated Person coauthors = 4; } 


The following table shows a publication and its authors. For readability, we 
abbreviated row keys and spelled out key-value pairs rather than showing their 
binary serialization. 


RowKey 

result: 

person: 

. . . hasAuthor: 

. . . isAuthorOf: 

body 

body 

301. . . 001: :9897. . . 

301. . . 001: :ef29. . . 

501. . . 001: :39b9. . . 

501... 0 
01::39 b9. . . 

resulttype = 
“publication”; 
title=“The Data 
Model of . . . 
dateofacceptance= 
“2012-01-01”; 
language=“en”; 
publicationDate= 
“2012”; publisher = 
“Springer”; 


ranking=l; 

ranking=2; 


301... 0 

01: :98 97. . . 


fir stname=“ Paolo”; 
lastname=“Manghi”; 



ranking= 1; 

30|... 0 

01::ef 29. . . 


fir stname= “ Nikos ”; 
lastname= “ Houssos”; 



ranking = 2; 


2.2 CSV 

CSV files aid the computation of statistics on the OpenAIRE information space. 
HBase is a sparse key value-store designed for data with little or no internal rela¬ 
tions. Therefore, it is impossible to run complex queries directly on top of HBase, 
for example a query to find all results of a given project. It is thus necessary to 
transform the data to a relational representation, which is comprehensible for 
statistics tools and enables effective querying. Via an intermediate CSV repres¬ 
entation, the data is imported into a relational database, which is queried for 
computing the statistics. 

In this generation process, each main entity type (result, project, person, 
organization, datasource) is mapped to a CSV file of the same name, which 
is later imported into a relational database table. Each single-valued attribute 
of an entity (id, title, publication year, etc.) becomes a field in the entity’s 
table. Multi-valued attributes, such as the publication languages of a result, 
are mapped to relation tables (e.g. result_languages) that represent a one-to- 
many relation between entity and attributes. Linked entities, e.g. the authors 
of a result , are represented similarly. As the data itself includes many special 
characters, for example commas in publication titles, the OpenAIRE CSV files 
use ! as a delimiter and wrap cell values into leading and trailing hashes: 












#dedup_wf_001::39b91277f9a2c25bl655436ab996a76b#!#The Data Model of the OpenAIRE 
Scientific Communication e-Infrastructure#!#null#!#null#!#Springer#!#null#!#null 
#!#null#!#null#!#2012#!#2012-01-01#!#0pen Access#!#0pen Access#!#Access#!#null#!# 
0#!#null#!#nulloai:http://helios-eie.ekt.gr:!#publication#10442/13187oai:pumaoai. 
isti.cnr.it:cnr.isti/cnr.isti/2012-A2-040#!#1#! 

Finally, using CSV has the advantage that existing tools such as Sqoop can 
be used, thus reducing the need to develop and maintain customly implemented 
components on the OpenAIRE production system. 

2.3 XML 

OpenAIRE features a set of HTTP API^] for exporting metadata as XML for 
easy reuse by web services. These APIs use an XML Schema implementation of 
the OpenAIRE data model called OAF (OpenAIRE Format where each record 
represents one entity. There is one API for searching, and one for bulk access. For 
example, the listing below comes from http : //api. openai re. eu/sea rch/publications 
?openairePublicationID=dedup_wf_001 :: 39b91277f9a2c25bl655436ab996a76b and 
shows the metadata of a publication that has been searched for. 

<oaf : result> 

< tie schemename="dnet :dataCite_title" classname="main title" 
schemeid="dnet :dataCite_title" classid="main title">The Data Model of the 

OpenAIRE Scientific Communication e-Infrastructure</ le> 

<dateofaccept ance>2012- 01 -01</dateof acceptance> 

< lisher>Springer</ lisher> 

< 'esulttype schemename="dnet :result_typologies" classname= M publication M 
schemeid="dnet :result_typologies" classid="publication M /> 

< .anguage schemename="dnet :languages" classname="English" 
schemeid="dnet :languages" classid="eng"/> 

<format>application/pdf</format> 

</oaf : result> 

The API for bulk access uses OAI-PMH (The Open Archives Initiative Protocol 
for Metadata Harvesting^ to publish metadata and its corresponding endpoint 
is at http://api.openaire.eu/oai_pmh. The bulk access API lets developers fetch 
the whole XML files step by step. For our experiments, we obtained the XML 
data directly from the OpenAIRE server, as an uncompressed Hadoop Sequence- 
FU^] comprising 500 splits of ~300 MB each. 

3 Implementing the OpenAIRE Data Model in RDF 

As the schema of the OpenAIRE LOD we specified an RDF vocabulary by 
mapping the entities of the ER data model to RDF classes and its attributes 

7 http://api.openaire.eu/ 

8 https://www.openaire.eu/schema/0.2/doc/oaf - 0.2.html 

9 http://www.openarchives.org/OAI/openarchivesprotocol.html 
10 http://wiki.apache.org/hadoop/SequenceFile 



and relationships to RDF properties. We reused suitable existing RDF vocab¬ 
ularies identified by consulting the Linked Open Vocabularies search servicj**] 
and studying their specifications. Reused vocabularies include Dublin Core for 
general metadata, SKOSp^l for classification schemes and CERIip 3 ] for research 
organizations and activities. We linked new, OpenAIRE-specific terms to reused 
ones, e.g., by declaring Result a superclass of http://purl.org/ontology/bibo/ 
Publication and http://www.w3.0rg/ns/dcat#Dataset, 

We keep the URIs of the LOD resources (i.e. entities) in the http://lod. 
openaire. eu/data/ namespace. We modelled them after the HBase row keys. In 
Open AIRE, these are fixed length identifiers of the form {typePrefix}\{namespace 
Prefix} ::md5hash. typePrefix is a two digit code, 10, 20, 30, 40 or 50, correspond¬ 
ing to the main entity types datasource, organization, person, project and result. 
The namespacePrefix is a unique 12-character identifier of the data source of the 
entity. For each row, md5hash is computed from the entity attributes. The result¬ 
ing URIs look like http: //lod. openaire. eu/data/result/dedup_wf_001: : 39b9127 
7f9a2c25b1655436ab996a76b. 

The following listing shows our running example in RDF/Turtle syntax. 

@prefix oad: <http://lod.openaire.eu/data/> . 

©prefix oav: <http://lod.openaire.eu/vocab#> . 

# further prefixes omitted; see http://prefix.cc for their standard bindings. 

oad:result/...001::39b9... rdf:type oav:Result, bibo:Publication; 

determs:title "The Data Model of the OpenAIRE Scientific Communication 
e-Infrastructure"@en ; 

determs:dateAccepted M 2012-01-01 M/ ^xsd:date ; 
determs:language "en"; 
oav:publicationYear 2012 ; 
determs:publisher "Springer"; 

determs:creator oad:person/...001::9897..., oad:person/...001::ef29... . 
oad:person/...001::9897... rdf:type foaf:Person; 
foaf:firstName "Paolo"; foaf:lastName "Manghi"; 
oav:isAuthorOf oad:result/...001::39b9... . 
oad:person/...001::ef29... rdf:type foaf:Person; 
foaf:firstname "Nikos"; foaf:lastName "Houssos"; 
oav:isAuthorOf oad:result/...001::39b9... . 


4 Requirements 

In cooperation with the other technical partners in the OpenAIRE2020 consor¬ 
tium, most of whom had been working on the infrastructure in previous projects 
for years, we established the following requirements for the LOD export: 

11 http://lov.okfn.org 

12 http://www.w3.org/2004/02/skos/ 

13 Common European Research Information Format; see http://www.eurocris.org/ 
cerif/main-features-cerif 



R1 The LOD output must follow the vocabulary specified in section [3] 

R2 The LOD must be generated from one of the three existing data sources, to 
avoid extra pre-processing costs. 

R3 The mapping to LOD should be maintainable w.r.t. planned extensions of the 
OpenAIRE data model (such as linking publications and data to software) 
and the evolution of linked data vocabularies. 

R4 The mapping to LOD should be orchestrable together with the other existing 
OpenAIRE data provision workflows, always exposing a consistent view on 
the information space, regardless of the format. 

R5 To enable automatic and manual checks of the consistency and correctness 
of the LOD before its actual publication, it should be made available in 
reasonable time in a private space. 

To prepare an informed decision on the preferred input format to use for the 
LOD export, we realised one implementation for each of HBase, CSV and XML. 

5 Technical State of the Art 

For each possible approach, i.e. mapping HBase, CSV or XML to RDF, we briefly 
review the state of the art to give an overview of technology we could potentially 
reuse or build on, whereas section [8] reviews work related to our overall approach. 
We assess reusability w.r.t. the OpenAIRE-specific requirements stated above. 

HBase, being a sparse, distributed and multidimensional persistent sor¬ 
ted map, provides dynamic control over the data format and layout. Several 
works have therefore explored the suitability of HBase as a triple store for semi- 
structured and sparse RDF data. Sun et al. adopted the idea of the Hexastore in¬ 
dexing technique for storing RDF in HBase [5]. Khadilkar et al. focused on a dis¬ 
tributed RDF storage framework based on HBase and Jena to gain scalability [7]. 
Others have provided MapReduce implementations to process SPARQL queries 
over RDF stored in HBase m- 

We are only aware of one work on exposing data from column-oriented stores 
as RDF. Kiran et al. provide a method for generating a SPARQL endpoint, i.e. 
a standardized RDF query interface, on top of HBase m- They map tables to 
classes, rows to resources, and columns to properties. Their approach do not 
scale well with increasing numbers of HBase entries, as the results show that the 
time taken to map HBase data to RDF is in hours for a few million rows m- 
CSV is widely used for publishing tabular data m • The CSV on the Web 
W3C Working Groupp 4 ] provides technologies for data dependent applications 
on the Web working with CSV. Several existing implementations, including that 
of Anything To Triples (any23p^| map CSV to a generic RDF representation. 
Customizable mappings are more suitable for our purpose. In Tarql (Transform¬ 
ation SPARQL0 one can define such mappings in SPARQL; Tabels (Tabular 

14 http://www.w3.org/2013/05/lcsv-charter.html 

15 http://any23.apache.org 

16 https://tarql.github.io 



Cells)[^]and Sparqlif}p*| use domain-specific languages similar to SPARQL. Ta- 
bels provides auxiliary machinery to filter and compare data values during the 
transformation process. Sparqlify is mainly designed to map relational databases 
to RDF but also features the sparqlify-csv module. 

XML is used for various data and document exchange purposes. Like for 
CSV—RDF, there are generic and domain-specific XML—RDF approaches. 
Breitling implemented a direct, schema-independent transformation, which re¬ 
tains the XML structure |T3]. Turning this generic RDF representation into a 
domain-specific one requires post-processing on the RDF side, e.g., transform¬ 
ations using SPARQL CONSTRUCT queries. On the other hand, the current 
version of Breitling’s approach is implemented in XSLT 1.0, which does not 
support streaming and is therefore not suitable for the very large inputs of the 
OpenAIRE setting. Klein uses RDF Schema to map XML elements and attrib¬ 
utes to RDF classes and properties [14j . It does not automatically interpret the 
parent-child relation between two XML elements as a property between two re¬ 
sources, but a lot of such relationships exist in the OpenAIRE XML. XSPARQL 
can transform XML to RDF and back by combining the XQuery and SPARQL 
query languages to US; authoring mappings requires good knowledge of both. By 
supporting XQuery’s expressive mapping constructs, XSPARQL requires access 
to the whole XML input via its DOM (Document Object Model), which results 
in heavy memory consumption. A subset of XQuer^] is suitable for streaming 
but neither supported by the XSPARQL implementation nor by the free version 
of the Saxon XQuery processor required to run XSPARQL. 


6 Implementation 


As the only existing HBase^RDF implementation does not scale well (cf. sec¬ 
tion [5| , we decided to follow the MapReduce paradigm for processing massive 
amounts of data in parallel over multiple nodes. We implemented a single MapRe¬ 
duce job. Its mapper reads the attributes and values of the OpenAIRE entities 
from their protocol buffer serialization and thus obtains all information required 
for the mapping to RDF. Hence no reducer is required. The map-only approach 
performs well thanks to avoiding the computationally intensive shuffling. RDF 
subjects are generated from row keys, predicates and objects from attribute 
names and cell values or, for linked entities, from column families/qualifiers. 

Mapping the OpenAIRE CSV— ^ RDF is straightforward: files correspond 
to classes, columns to properties, and each row is mapped to a resource. We 
initially implemented mappings in Tarql, Sparqlify and Tabels (cf. section [ 5 ]) 


17 http://idi.fundacionctic.org/tabels 

18 https://github.com/AKSW/Sparqlify 12 

19 cf. ‘Streaming in XQuery’, http://www.saxonica.com/html/documentation/ 
sourcedocs/streaming/streamed -query.html 



and ended up preferring Tarql because of its good performancj^] and the most 
flexible mapping language - standard SPARQIp^~| with a few extensions. As 
we map CSV—RDF, as opposed to querying CSV like RDF, we implemented 
CONSTRUCT queries, which specify an RDF template in which, for each row 
of the CSV, variables are instantiated with the cell values of given columns. 

To enable easy maintenance of XML—^RDF mappings by domain experts, 
and efficient mapping of large XML inputs, we implemented our own approach^] 
It employs a SAX parser and thus supports streaming. Our mapping language 
is based on RDF triple templates and on the XPatfp^] language for addressing 
content in XML. XPath expressions in the subjects or objects of RDF triple tem¬ 
plates indicate where in the XML they obtain their values from. To keep XPath 
expressions simple and intuitive, we allow them to be ambiguous, e.g., by saying 
that oaf:result/publisher/text() (referring to the text content of the publisher ele¬ 
ment of a result) maps to the determs .'publisher property of an oav:Result , and 
that oaf‘.result/dateofacceptance/text () maps to determs: date Accepted. In theory, 
any combination of publisher and dateof acceptance elements would match such 
a pattern; however in reality only those nodes that have the shortest distance 
in the XML document tree represent attributes of the same OpenAIRE entity. 
XML Filters [16] efficiently restrict the XPath expressions to such combinations. 

7 Evaluation 

7.1 Comparison Metrics 

The time it takes to transform the complete OpenAIRE input data to RDF is 
the most important performance metric (requirement |R4| ) . The main memory 
usage of the transformation process is important because OpenAIRE2020 en¬ 
visages the development of further services sharing the same infrastructure, in¬ 
cluding deduplication, data mining to measure research impact, classification of 
publications by machine learning, etc. One objective metric for maintainability 
is the size of the mapping’s source code - after stripping comments and compres¬ 
sion, which makes the comparison ‘independent of arbitrary factors like lengths 
of identifiers and amount of whitespace’ HZ10 The ‘cognitive dimensions of 
notation’ (CD) evaluation framework provides further criteria for systematically 
assessing the ‘usability of information artefacts’ [18]. The following dimensions 
are straightforward to observe here: closeness of the notation to the problem 
(here: mapping HBase/CSV/XML to RDF), terseness (here measured by code 

20 Tabels failed to handle large CSV files because it loads all the data from the CSV into 
main memory; Sparqlify works similar to Tarql but with almost doubled execution 
time (7,659 s) and more than doubled memory usage. 

21 http://www.w3.org/TR/sparqlll-query/ 

22 See source code and documentation at https://github.com/allen501pc/XML2RDF 

23 http://www.w3.org/TR/xpath20/ 

24 We used tar cf - <input files> | xz -9. For HBase, we considered the part of the 
Java source code that is concerned with declaring the mapping, whereas our CSV 
and XML mappings are natively defined in high-level mapping languages. 



size; see above), error-proneness , progressive evaluation (i.e. whether one can 
start with an incomplete mapping rule and evolve it to further completeness), 
and secondary notation and escape from formalism (e.g. whether reading cues 
can be given by non-syntactic means such as indentation or comments). 

7.2 Evaluation Setup 

The HBase—)>RDF evaluation ran on a Hadoop cluster of 12 worker nodes 
operated by CNR{^] As our CSV— ^RDF and XML^RDF implementations 
required dependencies not yet installed there, we evaluated them locally: on a 
virtual machine on a server with an Intel Xeon E5-2690 CPU, having 3.7 GB 
memory and 250 GB disk space assigned and running Linux 3.11 and JDK 1.7. 
As we did not have a cluster available, and as the tools employed did not natively 
support parallelization, we ran the mappings from CSV and XML sequentially. 


7.3 Measurements and Observations 


The following table lists our measurements; further observations follow below. 


Objective Comparison Metrics 


HBase 


CSV 


XML 


Mapping Time(s) 

Memory (MB) 

Compressed Mapping Source Code (KB) 
Number of Input rows/records 
Number of Generated RDF Triples 


1,043 

68,000 

4.9 

20,985,097 


4,895 45,362 

103 130 

2.86 1.67 

203,615,518 25,182,730 


655,328,355 654,193,273 788,953,122 


For HBase—^RDF, the peak memory usage of the cluster was 68 GB, i.e. 
^5.5 GB per worker node. No other MapReduce job was running on the cluster 
at the same time; however, the usage figure includes the memory used by the 
Hadoop framework, which schedules and monitors job execution. 

The 20 CSV input files correspond to different entities but also to rela¬ 
tionships. This, plus the way multi-valued attributes are represented (cf. sec¬ 
tion 2.2), causes the high number of input rows. The size of all files is 33.8 
GB. The XML^RDF memory consumption is low because of stream pro¬ 


cessing. The time complexity of our mapping approach depends on the number 
of rules (here: 118) and the size of the input (here: 144 GB). With the com¬ 
plexity of the XML representation, this results in an execution time of more 
than 12 hours. The size of the single RDF output file is ~91 GB. Regarding 
cognitive dimensions , the different notations expose the following characterist¬ 
ics; for lack of space we focus on selected highlights. Terseness : the high-level 
CSV—RDF and XML—)► RDF languages fare better than the Java code required 
for HBase—)>RDF. Also, w.r.t. closeness , they enable more intuitive descriptions 
of mappings. As the CSV—RDF mappings are based on SPARQL, which uses 
the same syntax for RDF triples than the Turtle RDF serialization, they look 


https://issue.openaire.research-infrastructures.eu/projects/openaire/wiki/ 
Hadoop_Clusters#section-3 







closest to RDF. Error-proneness : Syntactically correct HBase—)>RDF Java code 
may still define a semantically wrong mapping. In Tarql’s CSV—RDF mappings, 
many types of syntax and semantics errors can be detected easily. Progressive 
evaluation : one can start with an incomplete Tarql mapping rule CSV—RDF 
mapping rule and evolve it towards completeness. Secondary notation : Tarql 
and Java support flexible line breaks, indentation and comments, whereas our 
current XML—^RDF mapping implementation requires one (possibly long) line 
per mapping rule. Overall, this strongly suggests that CSV—RDF is the most 
maintainable approach. 

8 Related Work 

Comparisons of different approaches of mapping data to RDF have mainly been 
carried out for relational databases as a source da?]. Similarly to our evaluation 
criteria, the reference comparison framework of the W3C RDB2RDF Incubator 
Group covers mapping creation, representation and accessibility, and support 
for data integration m Hert et al. compared different RDB2RDF mapping 
languages w.r.t. syntactic features and semantic expressiveness 122- 

For other linked datasets about research, we refer to the ‘publication’ and 
‘government’ sectors of the LOD Cloud, which comprises, e.g., publication data¬ 
bases such as DBLP, as well as snapshots of funding databases such as CORDIS. 
From this it can be seen that OpenAIRE is a more comprehensive data source 
than those published as LOD before. 

9 Conclusion and future work 

We have mapped a recent snapshot of the OpenAIRE data to RDF. A pre¬ 
liminary dump as well as the definitions of the mappings are available online 
at http://tinyurl.com/OALOD. Mapping from HBase is fastest, whereas map¬ 
ping from CSV promises to be most maintainable. Its slower execution time is 
partly due to the less powerful hardware on which we ran it; comparing mul¬ 
tiple CSV—RDF processes running in parallel to the HBase—>RDF implement¬ 
ation on the CNR Hadoop cluster seems promising. Based on these findings the 
OpenAIRE2020 LOD team will decide on the preferred approach for providing 
the OpenAIRE data as LOD; we will then make the data available for browsing 
from their OpenAIRE entity URIs, and for querying via a SPARQL endpoint. 

Having implemented almost the whole OpenAIRE data model, future steps 
include interlinking the output with other existing datasets. E.g., we so far out¬ 
put countries and languages as strings, whereas DBpedia and Lexvo.org are suit¬ 
able linked open datasets for such terms. Link discovery tools will further enable 
large-scale linking against existing ‘publication’ and ‘government’ datasets. 
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